Scan  Directed  Load  Balancing  for 
Highly- Parallel  Mesh-Connected 
Computersi 

Edoardo  S.  Biagioni 
Jan  F.  Prins 

Department  of  Computer  Science 
University  of  North  Carolina 
Chapel  Hill,  N.C.  27599-3175  USA 
biagioni@cs.unc.edu 
prins@cs.unc.edu 

Abstract 

Scan  Directed  Load  Balancing  is  a  new,  locality-preserving,  dynamic 
load  balancing  algorithm  for  grid-based  computations  on  mesh-connect¬ 
ed  parallel  computers.  Scans  are  used  to  efficiently  determine  what  areas 
of  the  machine  are  heavily  loaded  and  what  areas  are  lightly  loaded, 
and  to  organize  the  movement  of  data.  Data  is  shifted  along  the  mesh 
in  a  regular  fashion  to  balance  the  load.  The  Locality  Property  of  the 
algorithm  guarantees  that  all  the  neighbors  of  a  data  point  on  the  grid 
are  stored  either  on  the  same  processor,  or  on  a  processor  that  is  directly 
connected  to  it. 

Scan  Directed  Load  Balancing  is  applicable  to  both  SIMD  and  M1MD 
mesh-connected  parallel  computers,  and  has  been  implemented  on  the 
MasPar  MP-1.  We  present  some  theoretical  bounds  achieved  by  the 
algorithm  as  well  as  the  algorithm’s  performance  on  a  particular  im¬ 
age  processing  problem,  edge-directed  diffusion.  Our  experiments  show 
that  the  algorithm  is  effective  in  improving  the  load  distribution  for  real 
problems,  while  the  efficiency  of  the  original  grid-based  computation  is 
preserved  by  the  locality  property. 
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Introduction 


A  large  class  of  scientific  and  engineering  problems  can  be  solved  by 
repeated  local  computations  at  every  point  of  data  arranged  in  some 
regular  grid.  A  classic  example  is  the  explicit  solution  of  differential 
equations  such  as  those  describing  the  diffusion  of  heat  through  a  sur¬ 
face.  Other  examples  include  the  simulation  of  cellular  automata  and 
operations  on  image  data.  In  each  case  the  computation  at  a  given  grid 
element  is  local  because  it  depends  only  on  the  element  and  its  nearby 
neighbors  on  the  grid. 

A  mesh-connected  parallel  computer  is  characterized  by  processors  ar¬ 
ranged  in  a  regular  grid  with  an  interconnection  network  providing  local 
communication  between  adjacent  processors  in  the  grid.  Global  com¬ 
munication  between  arbitrary  processors  is  achieved  by  routing  data 
over  successive  links  in  the  network.  Global  communication  delays  scale 
with  the  number  of  processors  whereas  local  communication  delays  are 
not  subject  to  such  a  relationship.  Consequently  grid-based  computa¬ 
tions  that  are  local  in  nature  are  ideally  suited  for  execution  on  mesh- 
connected  computers  when  the  data  grid  can  be  put  into  correspondence 
with  the  mesh  structure  of  the  machine.  Indeed  the  earliest  demonstra¬ 
tions  of  success  with  scalable  parallel  computers  were  on  such  problems: 
heat  diffusion  on  Illiac  IV  [13]  and  Ising  spin  computations  on  the  DAP 
[II]. 

In  this  paper  we  are  concerned  with  techniques  to  efficiently  execute  a 
class  of  such  local  computations  in  a  setting  where  the  data  grid  is  large 
relative  to  the  number  of  processors,  and  the  local  computations  exhibit 
fixed  points  which,  when  attained,  do  not  require  further  recomputation 
while  data  values  in  the  local  neighborhood  remain  unchanged. 

Since,  in  a  computation  on  a  large  grid,  each  processor  is  responsible 
for  multiple  data  points,  the  presence  of  local  fixed  points  makes  possible 
a  more  efficient  execution  technique  in  which  each  processor  only  evalu¬ 
ates  “active”  points  in  the  portion  of  the  grid  for  which  it  is  responsible. 
The  problem  arises,  however,  that  the  distribution  of  active  points  may 
become  unpredictable  as  the  computation  evolves  and  may  be  highly 
irregular.  Consequently,  a  static  assignment  of  grid  points  to  processors 
may  lead  to  uneven  numbers  of  active  data  points  at  each  processor,  re¬ 
ducing  the  efficiency  of  the  parallel  computation.  In  the  worst  case  one 
processor  holds  all  the  active  points,  and  no  parallel  speedup  is  attained 
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Symbol 

Value  or  Definition 

Explanation 

k 

z 

i... 

0...k-  1 

dimension  of  the  mesh  and  data 
index  over  dimensions 

Pk 

P»1 

number  of  processors 

«'» 

0...P-  1 

processor  index  along  dimension  z 

/ 

[io, ....  U-i] 

processor  index 

U 

f(P“  1) . (P  “  1)1 

maximum  processor  index 

Pt 

•  .  • 

processor  at  index  I 

M») 

.  .  . 

hyperplane  pi  where  /*  =  i 

A, 

•  .  . 

active  points  on  processor  pi 

A  ,ot 

EiAi 

number  of  active  points 

nk 

•  .  . 

number  of  data  points 

h. 

0...B-  1 

data  index  along  dimension  z 

H 

I*o . Afc-i] 

data  point  index 

Dh 

... 

data  point  at  index  H 

p{H) 

pH-! 

mapping  of  data  points  to  processors 

bk 

•  «  • 

buffet  space  per  processor 

*M0 

1.0. 

max  data  index  on  x-hyperplane  i 

9x 

l — *r«C*) 

index  over  data  points  in  processors  i 

•  .  • 

data  hyperplane  at  index  gx  in  pr(>) 

c 

[ffo,...,  J*-l] 

data  point  index  within  a  processor 

HU.V) 

z.\u.  -  v,\ 

Manhattan  distance  between  U  and  V 

HV) 

[Vit{VtV)m\} 

indices  of  neighbors  of  index  V 

mar,(/l) 

M  AX  jgu-..)Aj 

max-reduction  across  axis  z  of  A 

maXj(A) 

activity  of  data  z-hyperpiane  (1,9) 

S:., 

activity  of  processor  z-hyperplane  1 

A 

maxx(A)/p 

average  load  per  processor  Ityperplane 

D 

nk/p 

points  per  processor  per  dimension 

MV) 

Zj<.V> 

left-lo-right  plus-scan  of  V 

L, 

activity  over  hyperplanes  <  » 

F, 

A  •  »  —  L, 

active  point  flow,  1  to  »  —  1 

a 

Al(bk  -  D) 

load  equivalent  of  inactive  data  point 

Table  i 
Notation 

at  all. 

To  maintain  parallel  efficiency,  a  dynamic  load  balancing  strategy 
can  be  used  to  redistribute  the  data  grid  over  processors  periodically 
in  response  to  the  observed  distribution  of  active  points.  Dynamic  load 
balancing  strategies  have  been  studied  extensively  (e.g.  (5), [3])  for  large- 
grain  computations  in  small-scale  parallel  systems  where  communication 
is  not  a  limiting  issue.  In  such  a  setting,  migration  of  a  computation 
away  from  the  data  it  references  is  not  considered  to  be  a  critical  issue. 
In  a  scalable  parallel  system,  however,  this  migration  leads  to  the  loss  of 
the  local  communication  property,  reducing  the  scalability  of  the  com¬ 
putation.  This  paper  presents  a  dynamic  load-balancing  technique  that 
preserves  locality  of  communication  in  the  computation. 

The  technique  is  applicable  to  mesh-connected  MIMD  or  SIMD  ma¬ 
chines,  including  boolean  hypercube  machines,  although  the  implemen¬ 
tations  reported  on  in  this  paper  are  for  the  MasPar  MP-1,  a  two- 
dimensional  mesh-connected  SIMD  machine.  The  load  balancing  com¬ 
putation  uses  global  information  to  balance  the  load  over  the  full  ma¬ 
chine;  all  global  communication  is  in  the  form  of  scans  (parallel  prefix 
operations),  which  are  efficiently  implemented  on  scalable  machines  (2). 

The  rest  of  this  paper  is  organized  as  follows.  The  next  section  de¬ 
scribes  our  assumptions  and  gives  some  definitions  and  notation  (in  Ta¬ 
ble  1)  to  be  used  in  subsequent  sections.  Section  3  describes  Scan  Di¬ 
rected  Load  Balancing  in  one-dimensional  arrays  of  processors,  and  Sec¬ 
tion  4  extends  Scan  Directed  Load  Balancing  to  k  dimensions.  Section  5 
presents  some  experimental  results  from  two-dimensional  load  balancing 
in  an  image  processing  problem.  The  last  section  reviews  the  charac¬ 
teristics  of  Scan  Directed  Load  Balancing  and  points  out  directions  for 
future  research. 

Decomposition  of  Data  into  Processors 

We  assume  the  data  D  is  arranged  in  a  t-dimensional  grid  in  which  every 
axis  has  length  n  for  a  total  of  N  =  nk  elements,  and  that  the  computa¬ 
tion  will  be  carried  out  on  a  t-dimensional  mesh-connected  machine  M 
with  P  =  pk  processors,  with  p  <  n  (for  p  >  n  load  balancing  is  trivial). 

The  dimensionality  of  the  target  machine  mesh  is  the  same  as  the  di¬ 
mensionality  of  the  data  grid  for  simplicity  of  exposition.  Standard  mem- 


ory  virtualization  techniques  can  be  used  [6]  to  simulate  a  /fc-dimensional 
mesh  efficiently  in  a  lower-dimensional  mesh,  and  grid  embedding  strate¬ 
gies  exist  [7]  to  simulate  a  ^-dimensional  mesh  in  a  higher  dimensional 
mesh  (such  as  a  boolean  hypercube).  A  grey-code  embedding  insures 
that  the  constant  cost  for  local  communication  in  the  t-dimensional 
mesh  is  preserved.  These  methods  and  the  techniques  to  be  presented 
are  also  readily  generalized  to  handle  a  rectangular  processor  mesh  or 
data  grid. 

A  domam  decomposition  p  maps  each  element  index  H  =  (h0 . h*_] ] 

of  the  grid  D  to  a  processor  index  p(H)  in  M.  The  hierarchical  decom¬ 
position  n(H)i  =  |hj£J  partitions  D  into  ^-dimensional  subcubes  of 
approximately  equal  volume  (to  be  precise,  each  side  of  the  subcube  has 
length  [n/pj  or  \n/p\).  This  decomposition  minimizes  communication 
in  the  evaluation  of  a  local  function  applied  at  every  point  on  the  data 
grid. 

The  hierarchical  decomposition  is  an  instance  of  an  orthogonal  de¬ 
composition.  A  decomposition  p  of  D  is  orthogonal  if  it  can  be  writ¬ 
ten  as  p([h0,  ...,hk_|])  =  [po(Ao),...,p4-i(ht>i],  where  p0 . >4-1  are 

functions  with  domain  0..n  -  I  and  range  Q.  p—  l.  In  other  words,  an 
orthogonal  decomposition  maps  each  dimension  of  the  data  grid  inde¬ 
pendently.  If  each  of  the  functions  p,  is  monotonically  nondecreasing 
and  surjective,  then  neighboring  points  on  the  data  grid  will  always  be 
found  in  the  same  or  neighboring  processors  in  the  machine. 

Scan  Directed  Load  Balancing  maintains  an  orthogonal  decomposition 
p  that  varies  in  response  to  the  measured  distribution  of  active  cells. 

We  can  restate  these  definitions  more  formally  using  the  notation  in 
Table  1.  Each  step  of  the  compulation  we  are  load  balancing  is  of  the 
form  D'J1  as  /(D'ff,  for  some  function  f,  where  H  ranges  over 

the  data  point  indices,  and  i '(H)  is  the  set  of  indices  which  form  the 
neighborhood  of  H .  At  all  times  we  will  want  the  decomposition  p  of  D 
to  respect  the  following  property. 

Locality  Property:  The  processor  distance  between  Dh  and  each  o; 
the  2*  points  in  Dv(H)  is  either  0  or  1. 

We  assume  that  the  grid  is  initially  decomposed  hierarchic^'ly  into 
the  machine. 


Lemma  1  Given  n  >  p,  the  hierarchical  decomposition  in  which  index 
H  —  is  assigned  to  processor  p/,  where  1  s  [ii  ...it]  and  ir  = 

|A,*J,  has  the  Locality  Property. 

Proof  The  processor  distance  between  the  data  point  with  index  H  and 
any  one  of  its  neighbors  with  index  H‘  is  6({i(H),n(H'))  =  6(1,  /'). 
From  the  definition  of  neighbors  (Table  1)  we  know  that  6(H,H')  =  1, 
so  3x  :  ((h*  —  /i'r|  =  1)  A  (Vy  ^  z,(hy  =  h'y)).  From  the  definition 
of  ir  and  from  hy  s  h'y,  it  follows  that  iy  =  =  L^  ynJ  =  *V 

Therefore,  6(1,1')  =  |i*  —  i',|  =  |[/»r£j  —  [/»'r£j|.  Since  n  >  p,  either 
6(1,  I')  s  0  or  6(1, 1')  s  lrand  the  locality  property  holds.  □ 

The  following  conditions  on  p  are  preserved  by  the  load  balancing 
algorithm,  and  are  sufficient  to  ensure  that  the  decomposition  has  the 
locality  property. 

For  any  data  index  H  ~  [ho,  ....  and  any  axis  0  <  x  <  Jfc 

(1)  hx  <p(//% 

(2)  ht  =  h'z  =>  fi(H),  =  p(/f')« 

and  for  any  processor  index  I 

(3)  3//:p(//)s7 

Figure  X 

Locality  Constraint! 


Lemma  2  If  the  locality  constraints  hold,  p( H)t  <  n(H')r  =>  hz  <  li'r. 

Proof  Given  p(//)r  <  p(//')r,  h't  <  hx  violates  constraint  1,  and  hT  = 
h'x  violates  constraint  2;  therefore,  n(H)x  <  fi(H')x  =>  hT  <  h‘t.  □ 

Theorem  1  If  the  Locality  Constraints  are  satisfied,  the  Locality  Prop¬ 
erty  holds. 

Proof  If  Theorem  1  didn’t  hold,  it  would  be  possible  to  destroy  the 
locality  property  without  violating  any  of  the  constraints.  Then  there 


I 


would  be  at  least  two  data  points,  Dh  on  processor  I  —  jj(H)  and 
its  neighbor  Dh •  on  processor  I'  =  such  that  (a)  6(H,H')  = 

l(bydefinitionofi/)  and  (b)  6(1,  /')  >  1.  Then  either  (1)  /  and  /'  differ 
only  in  dimension  x  so  that  Vy  ^  x,  (iv  =  i'y),  or  (2)  I  and  /'  differ  in 
at  least  two  dimensions  x  and  y. 

If  (1),  assume  without  loss  of  generality  that  »r  <  »*.  By  (b),  i'r-iz  > 
1,  so  there  is  some  processor  J,  ix  <  jT  <  i't,  that  by  Property  3  holds  at 
least  one  data  point  Dh",  H"  such  that  J  —  Then,  by  Lemma  2, 

hr  <  h"t  <  h't,  and  6(H,H')  >  1,  which  contradicts  (a). 

Conversely  if  (2)  holds,  I  and  I'  are  different  in  at  least  two  dimen¬ 
sions  x  and  y.  By  definition  of  6,  since  6(H,H')  =  1,  H  and  H'  differ 
in  at  most  one  dimension,  x,  so  either  r  ^  r  or  y  ^  r  or  both.  If  x  96  z 
then  hx  =  h'x  violates  Constraint  2  since  fi(H)x  ^  likewise  if 

y£z. 

Therefore,  the  Locality  Property  is  preserved  by  any  algorithm  that 
respects  the  Locality  Constraints.  □ 

One-Dimensional  Scan  Directed  Load  Balancing 

This  section  describes  Scan  Directed  Load  Balancing  for  cases  in  which 
fc  =  1,  so  the  computer  is  a  linear  array  of  processors  and  the  data  is  a 
linear  array  of  data  points. 

The  code  for  one-dimensional  Scan  Directed  Load  Balancing  has  two 
steps,  shown  in  Algorithm  1.  The  first  step  does  storage  computation, 
by  computing  the  difference  between  the  old  mapping  /  =  p(//)  and 
the  new  mapping  /'  =  fi'(H).  More  exactly,  what  is  computed  is  the 
number  of  active  points  that  must  be  transferred  between  every  pair  of 
contiguous  processors;  we  call  this  quantity  the  flow  of  active  pomis 
between  processors.  A  scan  gives  the  load  to  the  left  of  each  processor; 
the  difference  between  tins  and  the  desired  load  gives  F, ,  the  flow  of  ac¬ 
tive  points  between  processors.  The  second  step  does  data  movement, 
shifting  the  data  from  one  processor  to  another  until  each  data  point 
Dh  is  in  processor  The  function  put()  puts  the  leftmost  data 

point  of  each  active  p<  onto  the  rightmost  end  of  the  data  buffer  of  p, _  j ; 
the  function  get()  takes  the  rightmost  data  point  from  p, _  1  and  stores 
it  as  the  leftmost  point  in  p,-. 

In  the  computation  step,  the  number  of  active  data  points  that  must 


1.  Compute  flow* 

Li  —  ffi(A)  activity  to  left  of  processor  i 

A  —  (Ip_ i  +  Ap„i)/p  A  broadcast  to  all  processors 

if  (A  <  1)  then  abort  load  balancing  endif  avoid  underflows 
Fi  *—  iA  —  Li  flow  from  processor  i  to  i  —  l 


2.  Shift  data 

forall  (i)  in  parallel 
while  (Fi  >  1/2) 

if  active  (Do)  than  F, 
put  () 

shift-lwft  Di 
endwhile 

while  (Fi  <  -1/2) 
shift-right  Di 
gwt  () 

if  active  (Do)  then 
endwhile 
endfor 


tested  by  each  processor 
Fi  —  1  endif 

shift  own  data  buffer  left  by  1 

shift  own  data  buffer  right  by  1 
Fi  —  F,  +  1  endif 


Figure  3 

Algorithm  1.  One-dimensional  Scan  Directed  Load  Balancing 
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Example  of  l-dimensional  load  balancing  on  a  linear  array  of  3  processors,  showing 
initial  distribution  (top)  and  final  distribution  (bottom) 


flow  through  the  left  boundary  of  each  processor  is  computed  from  the 
processor  index,  the  average  number  of  active  data  points  per  processor, 
and  the  number  of  active  data  points  to  the  left  of  the  processor.  The 
broadcast  operation  by  which  A  is  communicated  to  all  processors  can 
be  implemented  by  a  right-to-left  copy  scan.  The  computation  step 
checks  that  the  average  load  is  at  least  one;  if  it  is  less  than  one,  any 
adjacent  active  data  points  would  be  distributed  over  more  than  two 
processors,  violating  the  third  locality  constraint,  so  the  load  balancing 
is  aborted.  The  computation  step  for  a  simple  example  is  shown  in  detail 
in  Figure  3,  together  with  the  results  of  the  storage  movement  step. 

The  storage  movement  step  has  two  phases.  First,  processors  with 
F,  >  0  shift  points  to  the  left  until  F,-  active  points  have  been  shifted. 
Then,  processors  with  Fj  <  0  take  points  from  the  left  until  -F,  active 
data  points  have  been  shifted.  Since  Fo  and  Fp  are  both  zero,  after  the 
shift  there  are  Li  +  F<  ss  iA  (actually,  [iAj  <  +  F*  <  fiA])  active 

points  to  the  left  of  each  processor  i,  at  which  point  the  load  is  balanced. 
If  only  active  points  are  shifted,  no  processor  will  ever  have  to  hold  more 
active  points  than  it  already  has  or  fA"),  whichever  is  greater,  as  long 
as  all  processors  with  Fj  >  1/2  shift  on  every  cycle  in  the  first  phase  of 
data  movement,  and  all  processors  with  —  F*  <  —1/2  do  likewise  in  the 
second  phase. 

Although  Algorithm  1  achieves  perfect  load  balance  to  within  integer 
constraints,  there  is  no  limitation  on  the  number  of  inactive  data  points 
that  may  end  up  in  a  single  processor.  For  example,  if  all  the  active  data 
is  in  one  half  of  the  machine  and  an  equal  amount  of  inactive  data  is  in 
the  other  half  of  the  machine,  an  even  load  distribution  will  push  all  the 
inactive  data  into  the  single  processor  at  the  inactive  end,  which  can  be 
a  problem  since  the  per-processor  memory  of  highly  parallel  computers 
is  often  limited. 

Some  programs  have  an  uneven  distribution  of  data  points  but  an  even 
load  per  data  point,  and  these  programs  will  balance  storage  by  balanc¬ 
ing  the  load.  Other  programs  have  an  even  distribution  of  points  but 
an  uneven  distribution  of  load  per  data  point,  and  for  these,  prevention 
of  buffer  overflows  is  important.  Algorithm  2  avoids  buffer  overflows  by 
giving  a  load  value  to  inactive  data  points;  this  insures  that  no  processor 
will  have  more  than  6  total  points,  where  6  is  the  maximum  number  of 
points  that  a  processor  can  store. 

Algorithm  2  is  similar  to  Algorithm  1.  The  computation  phase  first 


1.  Compute  floes 

Li  —  <Xi(A)  activity  to  left  of  processor  i 

A  *—  ( Lp-i  +  Ap.\)/p  A  computed  on  processor  p—  1 

if  (^4  <  1)  then  abort  load  balancing  endif  avoid  underflows 


a  _  A/{b  -  D) 

p—  1  computes,  broadcasts  a 

A[  «—  Ai  +  aDi 

compute  modified  load 

L\  —  <?i(A') 

new  load  to  left  of  processor  i 

A’-L^Jp 

A'  =  Ab/(b  -  D) 

Fi  -  iA'  -  L\ 

flow  from  processor  i  to  i  —  1 

Shift  data 

for all  (i)  in  parallel 

while  (F,  >  (1  +a)/2) 

tested  by  each  processor 

if  active  ( D0 )  then  F, 

-  F,-  -  (1  +  Q) 

else  F{  —  Fi  —  a  endif 
put  () 

shift-left  Di  shift  own  data  buffer  left  by  1 

endvhile 

while  ( Fi  <  -(1  +a)/2) 

shift-right  Di  shift  own  data  buffer  right  by  1 

get  () 

if  active  (£?0)  then  f,  «—  F,  +  (1  +  a) 
else  Fi  —  Fj  +  a  endif 
endehile 
endf or 


Figure  4 

Algorithm  2.  One-dimensional  Load  Balancing  with  Limited  Puffers 


computes  a,  the  pseudo-load  value  for  inactive  processes.  The  value  of 
a  is  selected  to  be  such  that  if  any  processor  has  b  inactive  processes, 
its  total  pseudo-load  is  exactly  A',  or,  conversely,  any  processor  with 
pseudo-load  A'  and  no  active  processes  has  exactly  6  data  points.  Since 
any  processor  with  pseudo-load  A'  and  any  active  points  has  fewer  than 
6  points,  this  guarantees  that  no  processor  will  ever  need  to  store  more 
than  b  points. 

The  only  difference  between  the  storage  movement  phases  of  the  two 
algorithms  is  the  amount  by  which  the  flows,  /),  are  updated  when 
shifting  out  an  active  or  an  inactive  data  point. 

Algorithm  1  computes  a  redistribution  that  gives  a  perfectly  balanced 
load  (within  integer  constraints)  of 

[A\  <  Ai  <  fAl 


since  A  —  0.5  <  A,-  <  A  +  0.5  and  A  is  an  integer. 
Algorithm  2,  on  the  other  hand,  only  guarantees  that 


A'  - 


1  +  a 
2 


<  A'  <  A'  + 


1  4-  a 
2 


where  A'  =  A+aD.  If  each  active  point  has  a  load  of  1,  the  pseudo-load 
of  an  active  point  is  1  +  o  and  Ai  <  A',/(  1  +o).  Since  o  =  A/(b  -  D), 
if  b  D,  a  is  small  and  the  performance  of  the  two  algorithms  is 
comparable;  if  D  is  close  to  b,  the  maximum  load  on  any  processor  may 
be  noticeably  higher  than  the  average  load. 

The  exact  bound  for  A,  in  terms  of  A,  o,  and  D  is 


,  .  A  +  qD  +  (1  +  a)/2 

Ai  <  - — - 

1  -+■  o 

which  can  be  reformulated  to  exclude  the  artificial  factor  a  and  give 

,  Ab  +  (A  +  b-  D)/2 
A  +  b-D 

If  most  data  points  are  active,  A  ss  D  and  the  bound  reduces  to  .4,  < 
A  +  l/2,  which  is  the  bound  for  Algorithm  I.  The  worst-case  performance 
is  obtained  when  b  as  D,  in  which  case  A,  <  D ,  the  best  bound  that 
can  be  given,  simply  means  very  little  redistribution  is  taking  place  and 
a  processor  may  happen  to  have  all  of  its  D  data  points  active.  This 
corresponds  to  intuition,  since  if  little  or  no  buffer  space  is  available,  load 


balancing  is  unable  to  contribute  much  to  performance  because  there  is 
no  flexibility  in  distributing  data. 

Finally,  the  next  theorem  shows  that  the  above  load  balancing  algo¬ 
rithms  satisfy  the-  Locality  Constraints  and  therefore  preserve  the  Lo¬ 
cality  Property. 

Lemma  3  The  put()  and  get()  operations  preserve  the  first  Locality 
Constraint. 

Proof  The  definition  of  put()  says  that  put()  transfers  the  leftmost 
data  point  on  one  processor  onto  the  rightmost  end  of  the  data  buffer 
of  the  neighbor  to  its  left;  this  preserves  property  1,  since  each  buffer 
in  each  processor  acts  like  a  FIFO  queue  and  no  data  point  can  ever 
“pass”  another  data  point.  The  same  is  true  of  get(),  which  proves  the 
lemma.  □ 

Theorem  2  Algorithm  1  preserves  the  Locality  Property. 

Proof  To  prove  that  Algorithm  1  preserves  the  Locality  Property,  as¬ 
sume  that  the  distribution  to  which  Algorithm  1  is  applied  already  sat¬ 
isfies  the  Locality  Constraints.  The  first  constraint  is  preserved  b;,  Al¬ 
gorithm  1  since  the  only  data  movement  operations  are  performed  by 
put()  and  get(),  which  by  Lemma  3  preserve  the  locality  constraints. 
The  second  constraint  is  trivially  true  in  one  dimension  because  there  are 
no  dimensions  orthogonal  to  the  axis  along  which  the  load  is  balanced. 
The  third  constraint  is  satisfied  by  avoiding  underflow:  after  the  load  is 
balanced  each  processor  has  a  load  between  f/i],  and  [Aj;  if  1  <  ]_.4J, 
each  processor  at  the  end  of  the  balance  will  have  at  least  one  ac-..e 
data  point,  thus  satisfying  constraint  3.  In  the  case  that  A  <  1,  no  load 
balancing  is  done  and  constraint  3  is  preserved.  □ 

Lemma  4  Algorithm  2  preserves  the  Locality  Property. 

Proof  Constraints  1  and  2  are  satisfied  as  for  Algorithm  1.  Constraint  3 
is  satisfied  because  the  load  on  each  processor  is  at  least  .4'  —  ( 1  +  o)/2; 
since  A'  =  A  +  a  and  A  >  1,  the  load  is  at  least  1  +  a  —  ( 1  +  a)/2  = 
(1  +  a)/ 2  >  0.  Since  every  processor  has  a  nonzero  pseudo- load,  every 
processor  has  at  least  one  data  point.  □ 


While  interest  in  one-dimensional  scan  directed  load  balancing  might 
seem  academic,  there  are  actually  many  problems  that  can  use  just  such 
a  scheme.  Simpler  variants  of  Algorithm  1  have  been  used  for  garbage 
collection  and  storage  management  on  both  the  Connection  Machine 
[2]  and  on  the  FFP  Machine  [9]  [1].  In  addition,  Algorithms  1  and  2 
can  be  used  on  any  computer  composed  of  a  linear  array  of  processors; 
any  multi-dimensionai  data  set  can  be  mapped  to  such  a  computer  by  a 
suitable  function. 

Scan  Directed  Load  Balancing  for  Higher  Dimensional 
Meshes 

The  algorithms  for  one-dimensional  Scan  Directed  Load  Balancing  can 
be  used  with  slight  changes  for  meshes  of  dimension  k  >  1.  In  this 
case  both  the  processor  mesh  and  data  grid  have  the  same  dimen¬ 
sion  k.  Multidimensional  Scan  Directed  Load  Balancing  simply  applies 
the  one-dimensional  algorithm  to  each  dimension  of  the  grid  indepen¬ 
dently.  When  balancing  dimension  x,  for  instance,  the  algorithm  treats 
the  Jb-dimensional  data  as  a  single  linear  array  of  ( k  —  l)-dimensional 
hyperplanes  and  balances  the  load  by  shifting  along  x  so  that  each  hy¬ 
perplane  of  processors  has  the  same  load. 

The  algorithms  use  local  addresses  for  the  data  points,  so  a  point  on 
processor  pj  is  identified  by  the  index  G  within  the  processor,  so  that  the 
pair  ( /,  G)  uniquely  identifies  a  data  point.  A  hyperplane  of  processors 
is  identified  by  the  single  index  i  and  the  dimension  x  to  which  it  is 
perpendicular,  whereas  a  hyperplane  of  data  is  identified  by  the  two 
indices  ( i,g )  and  the  dimension  x. 

The  algorithm  first  reduces  the  loads  on  the  processors  to  a  single 
vector  ofloads  by  computing  the  hyperplane  loads  g)  for  each  data 
hyperplane;  Wx(i, g)  is  the  maximum  number  of  active  point s  with  x- 
coordinate  ( i,g )  within  any  single  processor.  Wx(i,g )  can  be  computed 
using  segmented  max-scans.  The  vector  Sx,i  gives  the  load  of  each  x- 
hyperplane  of  processors. 

Once  a  single  load  has  been  computed  for  each  hyperplane  of  proces¬ 
sors,  each  hyperplane  is  treated  as  a  point  in  the  one-dimensional  load 
balancing  algorithm,  as  can  be  seen  by  comparing  Algorithm  3  to  Algo¬ 
rithm  1.  The  main  difference  is  that  in  the  one-dimensional  case  each 


lor  all  (z€{l..Jt}) 

1.  Compute  Hobs  in  dimension  z 

lorall  ( g  €  {0...r;r(t)} )  loop  over  data  indices 

W<tJ  *—  maxx(Ax(g))  activity  on  data  hyperplane  dx(i, g) 

end! or 

Si  *—  Sj(I Vitt)  activity  on  hyperplane  pr(i) 

L{  —  ffi(Si)  activity  over  hyperplanes  px(j),  j  <  i 

A  *-  (Lp_ j  +Sp_i)/p  broadcast  to  all  processors 

il  (A  <  maz(hx))  then  abort  endil  avoid  underflows 

Fi  —  iA  —  Li  flow  from  hyperplane  px(t)  to  pr(i  -  1) 

2.  Shift  data  along  dimension  z 
lorall  (i)  in  parallel 

while  ( Fi  >  Wi% 0/2)  tested  by  each  processor 

Fi  —  Fi  -  IV,  ,0 

put  ()  put  hyperplane  dx(i,  0)  to  processors  pr(:  -  1 ) 

shift-doun  D<,  W,  shift  to  fill  hyperplane  dr(i,  0) 

endshile 

%  -  load  of  topmost  data  of  pr(/  —  1) 

while  (Fi  <  —Up/2)  tested  by  each  processor 

shilt-up  D, ,  Wi  shift  to  clear  hyperplane  dx(i,0) 

get  ()  get  hyperplane  dr(i  —  1,  U'p)  into  dx(i,  0) 

F,  —  Fi  +  U^to 

lVp  —  U',.], rj.r. - 1 )  px(«  —  1)  may  or  may  not  call  get  () 

endwhile 
endlor 
endlor 
Figure  5 

Algorithm  3.  Scan  Directed  Load  Balancing  for  more  than  one  dimension 


point  has  a  load  of  either  0  or  1,  whereas  in  the  multi-dimensional  case 
each  hyperplane  of  data  may  have  a  load  considerably  greater  than  1; 
the  loop  termination  and  underflow  prevention  conditions  of  Algorithm  3 
reflect  this  difference.  Note  that  the  functions  get  ()  and  put  ()  now 
move  hyperplanes  of  data  (rather  than  data  points)  between  processors. 

Algorithm  4  is  the  multi-dimensional  analogue  of  Algorithm  2,  and 
a  straightforward  extension  of  Algorithm  3,  and  is  omitted  for  brevity. 
Each  processor  is  assumed  to  have  a  i-dimensional  buffer  for  data  of 
size  bk. 

The  underflow  condition  in  Algorithms  3  and  4  compares  the  active 
load  to  the  maximum  number  of  data  points  in  a  cell  over  all  data 
hyperplanes  perpendicular  to  the  axis  being  balanced.  As  long  as  the 
average  load  is  greater  than  this  number,  no  data  hyperplane  will  ever 
need  to  be  distributed  to  more  than  one  hyperplane  of  processors,  which 
preserves  the  third  Locality  Constraint. 

Unlike  the  one-dimensional  case,  the  improvements  in  load  balancing 
that  are  given  by  the  multi-dimensional  algorithm  cannot  be  as  neatly 
bounded  as  in  the  one-dimensional  case.  The  reason  for  this  is  that  the 
algorithm  evens  out  the  row  and  column  loads  independently,  and  in  the 
worst  case  this  may  not  be  effective  in  optimizing  the  load  per  proces¬ 
sor.  An  obvious  worst  case  is  when  the  entire  load  is  evenly  distributed 
among  the  processors  in  one  diagonal  of  the  mesh:  since  no  amount  of 
orthogonal  shifts  is  going  to  improve  the  load  balance,  Scan  Directed 
Load  Balancing  computes  flows  of  zero.  Such  worst  cases  are  rare  in 
practice,  as  can  be  shown  by  our  experiments. 

Experimental  Data 

An  example  of  a  computat  ion  to  which  these  techniques  can  be  applied  is 
the  edge-directed  diffusion  operation  in  image  processing  [12].  Edge- 
directed  diffusion  operates  on  image  data  arranged  as  a  grid  of  intensity 
values  (pixels).  At  each  iteration  a  pixel’s  value  is  set  to  be  a  weighted 
average  of  the  values  of  its  neighbors  and  itself,  with  neighbors  whose 
value  differs  greatly  (i.e.  those  defining  an  edge)  having  low  weight. 
Whenever  the  averaging  operation  yields  a  value  that  is  within  some 
threshold  e  of  the  current  value,  the  value  is  not  updated  and  the  local 
averaging  function  has  achieved  a  fixed  point. 


When  applied  to  an  image  such  as  the  one  in  Figure  6,  most  pixels  are 
modified  on  early  iterations.  Thereafter,  activity  tends  to  cluster  into  re¬ 
gions  surrounded  by  edges,  spreading  slowly  between  regions.  The  com¬ 
putation  terminates  after  some  predetermined  length  of  time  or  when 
all  pixels  have  reached  a  fixed  point.  The  result  on  the  sample  image  is 
shown  in  Figure  7.  The  algorithm  smooths  noise  in  areas  of  the  image 
that  don’t  have  sharp  edges,  but  does  not  blur  the  edges  themselves,  so 
the  overall  sharpness  of  the  image  is  preserved.  Edge-directed  diffusion 
is  used  at  the  University  of  North  Carolina  [4]  for  removing  noise  from 
medical  images  such  as  those  produced  by  CT  scanners. 

Scan  Directed  Load  Balancing  has  been  used  to  optimize  edge-directed 
diffusion.  The  algorithm  has  been  coded  to  run  on  the  MasPar  MP-1 
[10],  a  parallel  SIMD  computer  with  4096  processors  laid  out  in  a  64  x  64 
2-dimensional  mesh.  Each  processor  of  the  MP-1  has  a  4-bit  wide  ALU 
and  connections  to  8  neighbors,  4  in  the  orthogonal  and  4  in  the  diagonal 
directions. 

Even  though  the  MP-1  provides  a  general-purpose  router  which  is 
much  faster  for  sending  small  amounts  of  data  long  distances,  the  rela¬ 
tive  speed  and  bandwidth  of  the  mesh  connection  is  much  greater,  so  the 
diffusion  program  uses  only  the  mesh  for  communication.  The  image  is 
divided  into  tiles  and  the  program  performs  the  diffusion  independently 
on  each  tile  after  copying  the  pixels  from  the  edges  of  each  of  the  8 
neighboring  processors.  The  subdivision  of  the  image  into  contiguous 
rectangular  tiles  minimizes  the  distance  over  which  communication  oc¬ 
curs  and  the  total  bandwidth  of  communication,  so  that  significantly 
more  time  is  spent  computing  the  diffusion  than  communicating  pixel 
values.  Under  these  circumstances,  load  balancing  can  be  useful  in  im¬ 
proving  the  performance  of  the  program,  as  long  as  speedup  given  by 
load  balancing  is  greater  than  the  additional  overhead  due  to  load  bal¬ 
ancing. 

The  diffusion  program  was  run  on  32  medical  and  test  image  files  of 
sizes  512  x  512  pixels  (large),  400  x  448  pixels  (medium),  and  256  x  256 
pixels  (small).  On  the  64  x  64  processor  MP-1  this  gives  per-processor 
image  sizes  of  8  x  8,  6  x  8,  and  4x4  pixels  respectively. 

For  each  image,  the  performance  of  the  diffusion  algorithm  alone  was 
compared  to  the  performance  of  the  diffusion  algorithm  with  load  bal¬ 
ancing.  The  two  measures  of  performance  we  use  are  the  maximum 
load  on  any  given  processor  averaged  over  all  diffusion  cycles,  and  the 


Figure  6 

The  original  image 


Figure  7 

The  image  afler  running  Edge  Directed  Diffusion 


actual  time  it  took  to  complete  the  diffusion.  The  first  measure  is  an 
indication  of  how  successful  the  load  balancing  algorithm  is  in  reducing 
the  load  of  the  processors.  The  second  measure  takes  into  account  the 
costs  of  performing  the  load  balancing  and  is  a  more  accurate  measure  of 
the  success  of  this  implementation  of  Scan  Directed  Load  Balancing  in 
improving  overall  performance.  Since  different  implementations  might 
be  more  or  less  efficient  than  the  present  one,  the  first  measure  is  more 
useful  in  assessing  the  algorithm's  potential  for  improving  performance, 
whereas  the  second  measure  gives  the  actual  performance  of  the  current 
implementation  on  the  available  hardware.  Figures  8-10  give  the  actual 
measurements. 

Each  graph  shows  the  performance  of  the  load  balancing  algorithm 
for  Scan  Directed  Load  Balancing  applied  every  iteration  of  the  diffusion 
algorithm,  every  other  iteration,  every  fifth  iteration,  and  so  on.  The 
vertical  axis  gives  the  ratio  of  the  cost  using  load  balancing  to  the  cost 
without  load  balancing. 

The  first  observation  is  that  there  is  a  very  wide  spectrum  of  per¬ 
formances,  from  abysmal  to  very  good.  It  was  mentioned  at  the  end 
of  the  previous  section  that  performance  cannot  be  predicted  for  the 
multi-dimensional  case;  an  additional  problem  is  that  present  load  is 
not  always  an  accurate  predictor  of  future  load,  and  so  the  algorithm 
occasionally  makes  the  load  worse  than  it  was. 

On  average,  the  algorithm  is  quite  effective  in  reducing  the  load  More 
important,  the  average  load  decreases  monotonically  as  the  algorithm  is 
applied  more  and  more  frequently,  which  indicates  that  the  algorithm 
reliably  improves  the  load  balance. 

The  average  time  is  a  shallow  V-shaped  curve,  since  more  frequent 
applications  of  the  algorithm  mean  not  just  better  performance  on  the 
diffusion  task  but  also  higher  overhead  for  the  load  balancing.  For  the 
present  implementation,  the  optimal  ratio  of  diffusion  to  load  balancing 
cycles  is  in  the  neighborhood  of  five.  The  next  section  discusses  ad¬ 
ditional  hardware  that  could  be  provided  to  make  it  advantageous  to 
run  load  balancing  on  every  diffusion  cycle  and  to  reap  all  the  potential 
benefits  of  load  balancing. 

Another  interesting  comparison  is  between  images  of  different  sizes. 
A  comparison  of  the  graphs  for  the  performance  on  the  small  and  large 
images  shows  that  both  the  time  and  the  load  are  better  on  the  large 
images.  The  better  load  performance  on  large  images  can  be  explained 


diffusions  per  load  balancing 


by  considering  that  the  granularity  of  load  balancing  is  finer  on  the  larger 
images  and  the  load  balancing  algorithm  therefore  can  allocate  the  load 
more  accurately.  The  better  time  performance  on  larger  images  is  due 
to  both  the  finer  granularity  and  the  fact  that  each  diffusion  cycle  takes 
longer,  so  the  overhead  of  load  balancing  is  less  significant  than  on  the 
smaller  images. 

The  better  time  performance  on  the  one-dimensional  load  balancing 
is  due  in  part  to  less  overhead  for  the  load  balancing  itself,  and  in  part 
to  reduction  in  the  overhead  of  loop  processing.  The  MP-1  provides 
local  indirect  addressing,  which  is  used  to  build  a  worklist,  but  worklist 
processing  is  a  noticeable  fraction  of  the  processing  time,  and  applying 
the  load  balancing  algorithm  in  only  one  dimension  brings  many  of  the 
benefits  of  Scan  Directed  Load  Balancing  while  reducing  many  of  the 
costs  of  two-dimensional  load  balancing.  The  average  load  is  higher  in 
the  one-dimensional  case,  however,  as  might  be  expected  from  the  fact 
that  not  all  the  benefits  of  load  balancing  can  be  realized. 

Summary  and  Concluding  Remarks 

The  organized  synchronous  character  of  Scan  Directed  Load  Balancing 
is  a  quite  different  from  the  distributed,  demand-driven  nature  of  many 
popular  load  balancing  algorithms  [8]  [3].  Scan  Directed  Load  Balancing 
does  not  require  any  kind  of  tuning  and  provides  predictable  performance 
for  any  range  of  loads.  This  predictable  performance  is  particularly 
useful  for  parallel  computers  with  large  numbers  of  nodes  and  small- 
grain  processes,  in  which  an  algorithm  that  isn’t  centrally  organized 
may  not  be  as  quick  at  evening  out  load  imbalances. 

Scan  Directed  Load  Balancing  uses  barrier  synchronization  and  syn¬ 
chronous  transfer  of  per-process  data  to  guarantee  that  no  shift  will  ever 
overflow  a  data  buffer,  and  pseudo-loads  for  inactive  processes  if  such 
processes  are  present.  The  pseudo-load  mechanism  trades  off  proces¬ 
sor  activity  for  processor  space  to  simultaneously  bound  both  load  and 
buffer  space.  This  algorithm  does  not  always  achieve  perfect  balance; 
interesting  open  questions  are  whether  the  optimal  distribution  that  re¬ 
spects  buffer  size  constraints  can  be  efficiently  computed  in  parallel,  and 
whether  the  optimal  distribution  has  better  worst-case  performance  than 
Algorithm  2. 


Scan  Directed  Load  Balancing  preserves  the  Locality  Property,  which 
simplifies  programming  of  many  problems  and  makes  it  easier  to  add 
load  balancing  to  existing  programs.  The  Locality  Constraints  guar¬ 
antee  nearest-neighbor  access  of  neighboring  data  points,  which  is  very 
well  suited  to  SIMD-style  conflict-free  access  of  data  over  a  mesh  for 
programs  that  do  mostly  Local  Communication.  Guaranteeing  the  Lo¬ 
cality  Property  by  the  Locality  Constraints  in  multi-dimensional  meshes 
can  lead  to  load  imbalances  that  can’t  be  corrected,  independently  of 
the  algorithm  used  to  balance  the  load.  Careful  study  of  the  Locality 
Constraints  to  reduce  this  problem  may  yield  new  algorithms  that  might 
be  less  generally  applicable  (since  the  SIMD  style  of  mesh  usage  would 
have  to  be  abandoned)  but  have  better  worst-case  performance. 

The  measured  performance  of  Scan  Directed  Load  Balancing  shows 
that  such  worst  cases  are  rare  in  practice,  and  that  the  algorithm  as  it 
stands  can  effectively  reduce  the  load  of  an  average  computation.  The 
measured  performance  also  shows  that  the  present  implementation  is 
too  slow  to  produce  significant  improvements  for  the  anisotropic  diffu¬ 
sion  problem;  this  could  be  rectified  either  by  improving  the  software 
implementation  or  by  providing  specialized  hardware  support  for  scans 
and  shifts.  The  MasPar  MP-1  currently  uses  the  mesh  for  both  scans 
and  shifts;  since  these  mechanisms  are  crucial  to  the  performance  of 
Scan  Directed  Load  Balancing  (as  well  as  other  algorithms),  quantifying 
the  advantage  of  hardware-assisted  scans  or  shifts  is  a  very  important 
next  step  in  our  work.  Also  important  is  to  test  the  Load  Balancing 
algorithm  on  other  applications  so  the  performance  of  the  algorithm  it¬ 
self  can  be  evaluated  independently  of  the  specific  problem,  in  general, 
the  performance  of  Scan  Directed  Load  Balancing  should  improve  as  the 
grain  of  the  processes  increases,  and  since  the  processor  grain  in  edge 
directed  diffusion  is  quite  small  (a  diffusion  cycle  for  one  pixel  is  only 
a  few  operations),  it  is  reasonable  to  expect  that  the  algorithm  will  be 
effective  for  a  wide  variety  of  computations. 

Scan  Directed  Load  Balancing  is  a  minimalist  load  balancing  algo¬ 
rithm.  It  is  well-suited  for  both  SIMD  and  MIMD,  is  effective  for  both 
linear  arrays  and  multi-dimensional  meshes,  only  uses  scans  for  com¬ 
putation,  broadcasting  and  synchronization,  and  can  be  used  to  load 
balance  any  mesh-based  program  that  is  computation  intensive.  As  a 
result,  Scan  Based  Load  Balancing  is  useful  over  a  wide  range  of  ma¬ 
chine  architectures  and  computations,  and  especially  appropriate  for 


fine-grained  computations  on  massively  parallel  architectures. 
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